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abstract 

We propose a comprehensive metrics validation methodology that has 
;ix validation criteria, each of which supports certain quality 
:unctions. New criteria are defined and illustrated, including 
consistency, discriminative power, tracking and repeatability. We show 
ithat non-parametric statistical methods play an important role in 
evaluating metrics against the validity criteria. A detailed example 
ef the application of the methodology is presented. 

keywords: metrics validation methodology, validity criteria, quality 
functions, non-parametric statistical methods. 

INTRODUCTION 

If the software engineering community believes that the field of 
netrics should be engineering and not art, then it should subscribe to 
the idea that we evaluate (validate) whether metrics measure what they 
purport to measure prior to using the metrics. Furthermore, if metrics 
are to be of greatest utility, the validation should be performed in 
terms of the quality functions (quality assessment, control and 
prediction) that the metrics are to support. 

Our purpose is to propose and illustrate a validation methodology 
yhose adoption, we believe, would provide a rational basis for using 
netrics. We believe this to be the most comprehensive metrics 
nethodology ever proposed. There have been useful validation analyses 
performed on specific metrics or metric systems for the purpose of 
satisfying specific research goals. Among these validations are the 
following: 1 ) function points as a predictor of work hours across 
iifferent development sites and sets of data [ 2 ]? 2 ) reliability of 
netrics data reported by programmers [3]; 3) Halstead operator count 
for Pascal programs [7]; 4) metric-based classification trees [18]; 5) 
evaluation of metrics against syntactic complexity properties [19]. 
Our approach to validation differs in the following ways: 1) The 
nethodology is general and not specific to particular metrics or 
research objectives. 2) It is developed from the point of view of the 
metric user (rather than the researcher), who has requirements for 
assessing, controlling and predicting quality. To illustrate the 
difference in viewpoint, we can make an analogy with the automobile 
industry: the manufacturer has an interest in brake lining thickness, 
as it relates to stopping distance, but from the driver's perspective, 
the only meaningful metric is stopping distance! 3) It consists of six 
mathematically defined criteria, each of which is keyed to a metrics 
function, so the user of metrics can understand how a characteristic 
of a metric, as revealed by validation tests, can be applied to 
measure software quality. 4) It includes new criteria: consistency, 
discriminative power, tracking and repeatability . 5 ) it recognizes 
that a given metric can have multiple uses (e.g., assess, control and 
predict quality) and that a given metric can be valid for one use and 
invalid for another use. 6 ) It includes some useful statistical 
methods, rarely seen in the metrics literature, that are applied to 
metrics validation: partial linear correlation analysis, chi-square 
test for differences in probabilities (contingency tables), 
discriminant analysis and runs test. 
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It is not our purpose to be a proponent or an opponent of given 
metrics. Whether certain metrics pass or fail our validity tests in 
the examples is not the point of this paper. The examples are for the 
sole purpose of illustrating the application of the validation 
methodology. The validation results could be different in other 
applications and environments. 

We emphasize the use of non-parametric statistical techniques for 
metrics validation because: 1) their application is more consistent 
with the nature of metrics data (e.g., non-linearity, non-normality, 
large variability) than are parametric techniques and 2) the measures 
that result from their application are useful for quality assessment 
and control. 

Outline of Paper 

The following subjects are covered: 
o Definitions. 

o Rationale of Metrics Validation, 
o Quality Functions. 

o Non-parametric Statistical Methods for Metrics Validation, 
o Purpose of Metrics Validation, 
o Validity Criteria, 
o Example of Metrics Validation, 
o Summary and Future Research. 



DEFINITIONS 



Critical Value 



Metric value of a validated metric which is 
used to identify software which has 
unacceptable quality [11]. 



Quality Assessment Evaluation of the relative quality of software 

components. 



Quality Attribute A feature or characteristic that affects an 

item's quality [13], 



Quality Control 



A set of activities designed to evaluate the 
quality of developed components [modification 
of 13] . 



Quality Factor 



An attribute of software that contributes to 
its quality [11]. A quality factor is also a 
metric. 



Quality Metric 



A function whose inputs are software data and 
whose output is a single (numerical) value that 
can be interpreted as the degree to which 
software possesses a given attribute that 
affects its quality [13]. 



Quality Prediction A forecast of component quality. 
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Quality Requirement A requirement that a software attribute be 

present in software to satisfy a contract, 
standard, specification, or other formally 
imposed document [11]. 



Software Component General term used to refer to an element of a 

software system, such as module, unit, data or 
document [11]. 



Software Quality The degree to which software possesses a 

desired combination of attributes [12]. 

Validated Metric A metric whose values have been statistically 

associated with corresponding quality factor 
values [ 11 ] . 

For simplicity of expression, terms will be used without the 
qualifying word ('metric' instead of 'quality metric’) in the 
remainder of the paper except in the case of 'quality factor' which 
//ill be used to distinguish it from factor' of the statistical method 
'factor analysis'. 



RATIONALE FOR METRICS VALIDATION 

To help ensure that metrics are used appropriately, only validated 
•netrics (i.e., either quality factors or metrics validated with 
respect to quality factors) should be used. Quality factors are valid 
by definition. Furthermore, the metrics which are used should be those 
»/hich are associated with the quality requirements of the software 
project. Both product and process metrics are used to assess software 
quality. Our statements about product elements (i.e., components) 
apply equally to the processes which produce the products. 

It should be understood that if a metric is validated according to 
our criteria, there is no guarantee that it will faithfully represent 
a quality factor when applied. Validation is a statistical concept. As 
such, validation can only be performed within statistical error 
limits. The major benefit of validation is that it increases the 
probability that the metric will be a good indicator of quality. 

QUALITY FUNCTIONS 

Metrics are applied in three major quality functions: Quality 
Assessment, Quality Control and Quality Prediction. If metrics are to 
aid in making decisions about software quality, the user of metrics 
must understand how this tool supports major quality functions in a 
software engineering organization. Since metrics should not be 
validated unless the applications of metrics are clearly understood, 
it is worthwhile to describe the role of metrics during various 
software phases and the need to validate the metrics for specific 
metrics functions (i.e., the relationship must be made between 
(quality functions and validity criteria). Otherwise, a correlation 
coefficient of .9 between metric X and quality factor Y, for example, 
is only an abstraction. It only has meaning if validated in the 
context of quality functions. These purposes are best served by 
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introducing validity criteria on a qualitative basis now; later, 
mathematical definitions will be provided in the validation section. 

QUALITY ASSESSMENT 

Associativity 

Software managers need a rational basis for allocating personnel 
and computer resources to inspection, testing, and other quality 
activities. A method for doing this is to use metrics to provide a 
measure of relative quality across components. For example, the 
magnitudes of a metric are used to establish priority of testing and 
allocation of budget and effort to testing (i.e., the 'worst' 
component would receive the most attention, largest budget and most 
staff). One way to assess relative quality is as follows: 

If the elements of a metric vector M, corresponding to components 
1,2, ...,n, are ordered by magnitude, as shown below, does this imply 

an ordering of component quality? 

Magnitude[Ml > M2,...,> Mn] ----- > Monotonically Increasing Quality? 

( Decreasing ) 

The validity criterion which assesses the degree to which this 
relationship is satisfied is called associativity. A metric that is 
validated according to this criterion is used to compare magnitudes of 
a metric obtained from different components to estimate the degree to 
which they differ in quality (e.g., 'the quality of Component 2 is 
twice that of Component 1 ’ ) . 

Consistency 

It may be that the software manager is only interested in whether 
'Component 2 is better than Component 1’ rather than how much better. 
This approach has the advantage of not requiring a linear relationship 
between quality factors and metrics in order to have perfect 
association (e.g., if a factor varies as the cube of a metric, there 
is still perfect association). Thus, rank is the basis of comparison. 
Therefore, a second way to assess relative quality is as follows: 

If the elements of a metric vector M, corresponding to components 
1,2, ...,n, are ordered by rank, as shown below, does this imply an 

ordering of component quality? 

Rank[Ml > M2 , . . . , > Mn] => Monotonically Increasing (Decreasing) 
Quality? 

The validity criterion which assesses the degree to which this 
relationship is satisfied is called consistency. A metric that is 
validated according to this criterion is used to compare ranks of a 
metric obtained from different components to order the quality of a 
set of components. 
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(UALITY CONTROL 

t\ 

iscriininative Power 

Metrics are used to monitor the condition of a component to 
(etermine whether the component appears to be out of tolerance. This 
defined to be a component whose quality is below standard. This 
mplies that critical values of metrics must be established prior to 
he monitoring activity for comparing against the measured values 
erived from the component. 

In order to control quality during the design phase, components are 
dentified which appear to have unacceptable quality. Unacceptable 
uality may be manifested as excessive complexity, inadequate 
ocumentation, lack of traceability, or other undesirable attributes, 
he existence of such conditions is an indication that the software 
ay not satisfy quality requirements when it becomes operational, 
ince many of the quality factors which are usually of interest (e.g., 
eliability) , cannot be measured during design, and are only available 

i uring test and operation, validated metrics are used when quality 
actors are not available. Validated metric measurements are compared 
ith the critical values of the metrics. Components whose measurements 
re greater than (or less than) the critical values are flagged for 
etailed inspection. Depending on the results of the inspection, 
omponents are redesigned, scrapped, or not changed. The fact that a 
easurement is outside the critical value does not recessarily mean 
hat the component will exhibit unacceptable quality during operation; 
ather, it is a warning that the condition bears investigation. This 
oncept is illustrated in Figure 1 for metric vector M for components 
,2,...,n. The role of metrics validation for this use of quality 
ontrol is to identify a critical value of a metric, where that metric 
as been validated against a quality factor on a previous similar 
•roject. Then the metric can serve as a substiti-te to identify 
nacceptable quality during design. Such a metric satisfies the 
[iscriininative power validity criterion. 




Unacceptable Region 
Critical Value of Metric 
Acceptable Region 



Design Phase (Project Time >) 

figure 1. Application of Metrics to Quality Control (discriminative 
xjwer ) 



Tracking 



In addition to component quality lying within acceptable bounds, a 
desirable condition is for quality to improve over the life of the 
component (i.e., a component should exhibit quality growth). Thus, 
during all phases of the life of the component we wish to track 
quality in order to control quality. That is, we want to know whether 
the software is getting better, worse, or staying the same. Again, in 
most phases, the quality factor will not be available but we must know 
how quality might be changing, nevertheless. This concept is 
illustrated in Figure 2 for metric vector M for a given component i, 
measured at times Tl, T2,...,Tn. In this illustration, quality 
increases from Tl to T2 , stays the same from T2 to T3, and decreases 
from T3 to Tn, assuming high metric values are 'bad'. Here, the 
question for metrics validation is whether a metric can be identified 
whose changes over time will track changes in quality. In particular, 
if a metric has been validated as tracking a quality factor on a 
previous similar project, it would serve as a substitute for tracking 
quality on the given project. Such a metric satisfies the tracking 
validity criterion. 



M 




Project Time > 

Figure 2. Application of Metrics to Quality Control (tracking) 
QUALITY PREDICTION 
Predictability 

During the design phase validated metrics are used to make 
predictions of test or operational phase quality factors. Predicted 
values of quality factors are compared with target values. Components 
whose predicted quality factor values are greater than (or less than) 
the target values are flagged for detailed inspection. Potentially, 
prediction is more valuable than assessment and control because it 
estimates the attribute of ultimate interest -- the quality factor. 
However, prediction is more difficult because it involves using 
validated metrics from an early phase (e.g., design) to make 
predictions about a different but related attribute (quality factor) 
in a much later phase (e.g., operations). This concept is illustrated 
in Figure 3 where, at time Tl , metric M is used to predict the factor 
Fp at time T2 , for a given component, and Fa is eventually observed as 
the actual value at T2. The challenge to metrics validation is to find 
a metric or metrics that can predict a quality factor with acceptable 
accuracy. Such a metric satisfies the predictability validity 
criterion . 
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F = f (M , , 



Design 



Operations 



Tl 



T2 



•■igure 3. Application of Metrics to Quality Prediction ( predictability ) 

NON-PARAMETRIC STATISTICAL METHODS FOR METRICS VALIDATION 

Among the advantages of non-parametric statistical methods over 
[parametric methods [5,6,8] which are important for metrics validation, 
ire the following: 

o Assumptions less restrictive than with parametric methods. Given the 
loisiness of metrics data, this is a big plus. 

o No assumption about distribution (e.g., data does not have to be 
lormally distributed). 

3 Can use ordinal scale (i.e., component A is higher quality than 
component B). 

3 Can use nominal scale (i.e., A is high quality; B is low quality) 

3 Do not need interval scale (i.e. difference between A quality and B 
quality) . 



3 Do not need ratio scale (i.e., A is 2.5 the quality of B). 

For example, ranks of random variables [3] can be used rather than 
the values themselves, thus relaxing the assumptions about data 
relationships (e.g., linearity) while providing a measure of quality 
(e.g., ranking of components) that is useful to the software manager. 
In other words the fact that the data is not as 'well-behaved' as we 
•night believe it should be does not necessarily mean that it is less 
useful. In fact, when we consider that many useful applications of 
Tietrics can be derived from the ability to classify components as 
being better or 'worse', high quality' or low quality', acceptable 
or unacceptable, we realize that the information provided by non- 
parametric analysis is supportive of this approach.. 

Multivariate statistical methods (e.g., correlation analysis, 
factor analysis) are also used where appropriate. 
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PURPOSE OF METRICS VALIDATION 

The purpose of metrics validation is to identify metrics that are 
related to quality factors. If metrics are to be useful, they must 

indicate accurately whether quality requirements have been achieved or 
are likely to be achieved in the future. When it is possible to 

measure quality factors at the desired point in the life of the 
software, they are used to evaluate software quality. At other points, 
certain quality factors (e.g., reliability) are not available; they 

are obtained after delivery or late in the project. In these cases, 

metrics are used early in a project to assess, control and predict 
quality. 

It is important that metrics be validated before they are used to 
evaluate software quality. Otherwise, metrics may be misapplied (i.e., 
metrics may be used that have little or no relationship to the desired 
quality characteristics). 

VALIDITY CRITERIA 

To be considered valid, a metric must demonstrate a high degree of 
association with the quality factor it represents. A metric may be 
valid with respect to certain validity criteria and invalid with 
respect to other criteria. 

The validation procedure requires that threshold values of validity 
criteria be selected. These are the values V',' B' , A', and ~P' 

which are described below. The criterion used for selecting these 
values is reasonableness (i.e., judgement must be exercised in 
selecting values to strike a balance between the one extreme of 
causing a metric which has a high degree of association with a quality 
factor to fail validation and the other extreme of allowing a metric 
of questionable validity to pass validation). 

A short numerical example follows the definition of each validity 
criterion . 

Note: As previously stated, there are many advantages to using the 
general class of non-parametric statistical methods for metrics 
validation. However, although the specific methods that are associated 
with each validity criterion are appropriate, they are not necessarily 
the only methods that could be used. 

Associativity: The variation in the quality factor 

explained by the variation in the metric, which is given by 
the square of the linear correlation coefficient (R ) between 
the metric and the corresponding quality factor, must exceed 

V ( R 2 > V). 

This criterion assesses whether there is a sufficiently strong 
linear association between a quality factor and a metric to warrant 
using the metric as a substitute for the quality factor, when it is 
infeasible to use the latter. This criterion supports the quality 
assessment function. The multivariate statistical methods of linear 
correlation and partial linear correlation analysis [15] can be used 
for thi: test. 
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For example, the correlation coefficient between a complexity 
nitric and the quality factor reliability may be .9. The square of 
tiis is .81. Thus 81% of the variation in the quality factor is 
explained by the variation in the metric. If this relationship is 
chmonstrated over a representative sample of components, and if V has 
hen established as .7, one could conclude that the metric has the 
ability to associate complexity with reliability and can be used to 
eompare magnitudes of complexity obtained from different components to 
estimate the degree to which they differ in reliability. 

Consistency: If a quality factor vector FI, F2, ..., Fn, 
orresponding to components 1, 2, ..., n, has the relationship FI > F2 
: Fn, the corresponding metric vector must have the relationship 
1L > M2 > . . . , Mn. 

This criterion assesses whether there is consistency between the 
janks of the quality factor and the ranks of the metric for the same 
!=t of components. Thus this criterion is used to determine whether a 
istric can accurately rank, by quality, a set of components. This 
(riterion supports the quality assessment function. The non-parametric 
Statistical method Spearman Rank Correlation [3, 5, 6, 8] can be used for 
his test. 

For example, if the reliability of components A, B and C, as 
leasured by MTTF, is 1000, 1500 and 800 hours, respectively, and the 
'orresponding complexity metric values are 5, 3 and 7, where low 
jetric values are better' than high values, the ranks for reliability 
nd metric values, with '1' representing the 'highest' rank, are as 
ollows : 

Reliability Complexity 
Component Rank Rank 



B 1 1 

A 2 2 

C 3 3 



If this relationship is demonstrated over a representative sample 
f components, one could conclude that the metric is consistent and 
an be used to rank the quality of components. 



Discriminative Power: A metric must be able to discriminate between 
igh quality components (e.g., high MTTF) and low quality components 
e.g., low MTTF). For example, the set of metric values associated 
ith the former should be significantly higher (or lower) than those 
ssociated with the latter. 

This criterion assesses whether a metric is capable of separating a 
et of high quality components from a set of low quality components, 
'his capability allows one to establish critical values for metrics 
'hich can be used to identify components which may have unacceptable 
;uality. This criterion supports the quality control function. The 
ollowing non-parametric statistical methods can be used for this 
alidation test: Mann-Whitney Test [4, 5, 6, 8], chi-square test for 
lifferences in probabilities (contingency tables) [5,8] and the 
Irusal-Wallis Test [4, 5, 6,8]. The multivariate statistical method 
liscriminant analysis [1,15] can also be used. 
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For example, if all components with a complexity metric value of 
>10 (critical value) have a MTTF of 1000 hours and all components with 
a complexity metric value equal to or less than 10 have a MTTF of 2000 
hours, and this difference is sufficient to pass the statistical 
tests, then the metric separates low from high quality components. If 
the ability to discriminate is demonstrated over a representative 
sample of software components, one could conclude that the metric can 
discriminate between low and high reliability components. 

Tracking: If a metric M is directly related to a 
quality factor F, for a given component, then a change in 
a quality factor value from F T , to F T2 , at 

times T1 and T2, must be accompanied by a change in metric 
value from M T1 to M T2 , which is the same 

direction (e.g., if F increases, M increases). If M is 
inversely related to F, then a change in F must be 
accompanied by a change in M in the opposite direction 
(e.g., if F increases, M decreases). 

This criterion assesses whether a metric is capable of tracking 
changes in quality over the life of a component. This criterion 
supports the quality control function. The following non-parametric 
statistical methods can be used for this validation test: Spearman 
Rank Correlation and Wald-Wolf owitz Runs Test (test for randomness) 
[5,8]. 

For example, if a complexity metric is claimed to be a measure of 
reliability, then it is reasonable to expect a change in the 
reliability of a component to be accompanied by an appropriate change 
in metric value (e.g., if the component increases in reliability, the 
metric value should also change in a direction that indicates the 
component has improved). That is, if MTTF is used to measure 
reliability and is equal to 1000 hours during testing(Tl) and 1500 
hours during operation (T2), a complexity metric whose value is 8 in 
T1 and 6 in T2 , where 6 is 'better' than 8 (i.e., complexity has 
decreased), is said to track reliability for this component. If this 
relationship is demonstrated over a representative sample of 

components, one could conclude that the metric can track reliability 
(i.e., indicate changes in component reliability) over the life of the 
component . 

Predictability: If a metric is used at time T1 to 
predict a quality factor for a given component, it must 
predict a related quality factor F p T7 with an 

accuracy of: 

Fa T2 - f P T2 



where Fa T2 is the actual value of F at time T2. 
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This criterion assesses whether a metric is capable of predicting a 
;iality factor value with the required accuracy. It is simply a 
lilative error calculation [2,6], that takes into consideration the 
t.me of measurement. The multivariate statistical methods of linear 
ngression, multiple linear regression, and non-linear regression can 

used for this analysis. 

For example, if a complexity metric is used during design to 
edict the reliability of a component during operation (T2) to be 
!00 hours MTTF (F P T2 ) and the actual MTTF that is measured 

(jring operation is 1000 hours (Fa T2 ), then the error in 
jediction is 200 hours, or 20 1 . If the acceptable prediction error 
li) is 25^, prediction accuracy is acceptable. If the ability to 
( edict is demonstrated over a representative sample of components, 
lie could conclude that the metric can be used as a predictor of 
^liability. For example, prediction could be used during design to 
lentify those components that need to be improved. 

Repeatability: A metric must demonstrate the above associativity, 

onsistency, discriminative power, tracking, and predictability 
]roperties for P percent of the applications of the metric. 

This criterion is used to ensure that a metric has passed a 
alidity test over a sufficient number or percentage of applications 
o that there will be confidence that the metric can perform its 
ntended function consistently. 

For example, if the required 'success rate' (P) for validating a 
omplexity metric against the Predictability criterion has been 
stablished as 80%, and there are 100 components, ;he metric must 
redict the quality factor with the required accuracy for at least 80 
f the components. 

VALIDATION PROCEDURE 

Metrics validation includes the following steps: 

Identify the Quality Factors Sample 

Draw a random sample from the metrics database. 

' Identify the Metrics Sample 

Draw a random sample from the same domain (e.g., same software) of 
he metrics data base. 

► Perform Goodness of Fit Tests 

Perform goodness of fit tests on the quality factor and metrics 
lata to identify their distributions. 

) Perform a Statistical Analysis 

Perform a statistical analysis using the methods listed under 
Validity Criteria. 



} 1 Mpt-rins 
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Metrics validation is a continuous process. It is important to 
revalidate a metric each time it is used. As the software engineering 
process changes, the validity of metrics changes. A validated metric 
may not necessarily be valid in other environments or future 
applications. A metric that has been invalidated may be valid in other 
environments or future applications. 

o Validate and Apply Metrics in Similar Environments 

There have been great disparities in results reported in the 
literature concerning 'relationships' between metrics and the 
quantities they purport to measure. For example, correlation 
coefficients of number of errors with Halstead Effort and McCabe 
Complexity differ by a factor of almost two [11]. Differences have 
also been reported with respect to specification refinement levels 
[10]. These disparities point up the need to apply metrics under 
conditions that are similar to those used to validate the metrics. 

There should be a project in which metrics data have been collected 
and validated prior to application of the metrics. This project should 
be similar to the one in which the metrics are applied, with respect 
to application, project size, software engineering environment, design 
methodology, and programming language. In other words, to the extent 
possible, conduct a controlled experiment [6]. Validation and 
application of metrics should be performed during the same phases on 
different projects. Example: if metric X is collected during the 
design phase of project A and the saved values are later validated 
with respect to quality factor Y, which is collected during the 
operations phase of project A, the metric X should be used during the 
design phase of project B to assess quality factor Y with respect to 
the operations phase of project B. 

EXAMPLE OF VALIDATING METRICS 

The following example is provided to show how to make metric 
validation tests. No inferences should be drawn from this example 
regarding the validity of these metrics for other applications. These 
metrics are used for illustrative purposes only. The results of the 
validation tests could be different for other applications. The data 
used in the validation tests were collected from actual software 
projects . 

Purpose of Metrics Validation 

The purpose of this validation is to determine whether cyclomatic 
number (complexity ( C ) ) and size (number of source statements (S)) 
metrics, either singly or in combination, could be used to assess, 
control and predict the quality factor reliability, as represented by 
the quality factor error count (E). 

Validity Criteria 

Select values of V, B, A, and P. The values of V, B, A, and P, used 
in the example are .7, .7, 20%, and 80%, respectively. 
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VALIDATION PROCEDURE 
Perform the following validation steps: 
dentify the Quality Factor Sample 

i Draw a random sample of procedures (i.e., components), which is 
ummarized in Table 1, from the metrics data base, for the quality 
actor reliability, which is represented by the quality factor error 
■ount (Errors). The error counts are listed by project and procedure 
n Appendix A. 

dentify the Metrics Sample 

Using the same procedures (i.e., components) in Table 1, identify 
he metrics samples for cyclomatic number ( complexity ) and size 
statements) . The metrics values are listed by project and procedure 
.n Appendix A. 

Table 1 



Pro j ect 


Application 


Procedures 


Statements 


Errors 






(with errors) 






L 


String Processing 


11 ( 5) 


136 


10 


l 


Directed Graph Analysis 


31 (12) 


430 


27 


) 


Directed Graph Analysis 


1 ( 1) 


13 


1 


1 


Data Base Management 


69 (13) 


1021 


26 






112 (31) 


1600 


64 



dumber of procedures: 112 total, 31 with errors, 81 with no errors, 
■lumber of source statements: 2007 total, 1600 included in metrics 

analysis . 

Language : Pascal on all projects. 

Programmer: Single programmer. Same programmer on all projects. 



Perform Goodness of Fit Tests 

The best fits obtained 
iistributions : 



for the data are the following 



Errors: Negative Binomial (error procedures) 

Complexity: Negative Binomial (all procedures) 

Statements: Exponential (all procedures) 

Thus, this result discourages the use of statistical methods that 
depend on assumptions of normality and encourages the use of non- 
parametric methods. 

Perform a Statistical Analysis 



Perform the tests described under Validity Criteria. Significance 
evel and sample size are denoted by a and N, respectively; when it 
s necessary to specify a critical level of a in hypothesis tests, 
05 is used. 



Associativity 

1. Compute the sample linear correlation coefficient (R) for Errors 
(E) and Complexity (C) and for Errors (E) and Statements (S) and 

compare each R 2 with V - .7 [15]. 

Table 2 

Sample Correlations (Error Procedures) 

N = 31 

Complexity Statements 

Errors .7834 .5880 

a .0000 .0005 

Sample Correlations (All Procedures) 

N = 112 

Complexity Statements 

Errors .8010 .6596 

a .0000 .0000 

RESULT.- R 2 < V = .7. Fails minimum R 2 tests. 

2. Perform a null hypothesis test H 0 : p = 0 for E and C. [15). 

RESULT: Reject H (I with n: = .0000 and N = 31. 

3. Perform a null hypothesis test H (J : p > -[V = .836 for E 

and C, since we want R 2 > V = .7 [I5|. 

RESULT: Accept H n with a = .01 and N = 31. 

4. Compute the partial correlation coefficients for E, C, and S. The: 
coefficients give the strength of the linear relationship between t\ 
variables while controlling for the effects of the remaining variabl< 
[15]. This is a method for controlling for the effect of size (i.e 
when the partial correlation coefficient between E and C is compute< 
the effect of S is eliminated so that the association between E and 
alone can be observed). 



Table 3 

Sample Partial Correlations (Error Procedures) 
N = 31 

Complexity Statements 

Errors 0.64298 -0.08157 



Complexity 



0.65568 
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3SULT: From the low R for E and S, it can be seen that Statements 

cntributes essentially no additional information about Errors, once 
Complexity has been correlated with Errors. Also, the R for E and C 
indicates the correlation between Errors and Complexity with the 
afect of size (S) eliminated. 

i Compute a confidence interval of p for E and C |15J. 
isULT: .593 < p < .891 with a = .05 and N = 31. 

I:sts 3, 4 and 5 provide additional useful information about linear 

orrelation but they are not part of the required validation 
;-ocedure . 

c Perform a Factor Analysis 

»te: In this section a factor is defined as follows: 

= x i J F i +X 2 J F 2 + - - + X kj F k + U I , where 
i is a variable (metric), F,^,,. ..,F k are factors that are 
:>mmon to all the variables, U, is a random factor unique to X,, 
hd X, J ,\ 2 J ,...,X kj are factor loadings 
Correlations between variables and factors) 19,15], 

Do not confuse the use of the statistical term 'factor' with the 
i>e of the metrics term 'quality factor’. 

The objective of factor analysis is to reduce a set of metrics to a 
mailer, orthogonal set of factors that can better explain the 
clationships between correlated metrics. It frequently occurs that 
everal 'independent' variables (Complexity, Statements) that are used 
3 study the behavior of a dependent variable (Errors) are themselves 
^pendent and correlated -- the multicollinearity problem (See Table 
). Recent studies [14,17] have shown that a large number of metrics 
L6] can be reduced to a small manageable set that represents the 
iderlying relationship between the quality factor and one or more 
2 trics. The method is most useful when there are many metrics. The 
cample that follows only involves three metrics. The mechanics of the 
lalysis are to attempt to identify one or more factors that contain 
Lgh loadings for a subset of the metrics in the factor, including the 
lality factor, and low loadings for the remaining metrics. Then the 
ladings are examined, excluding the quality factor, to see which 
2 trics of the candidate factors from the first step have high 
ladings. These metrics would be emphasized in certain other analyses, 
Lke regression analysis. The remaining metrics would be deemphasized 
r discarded. An example is shown in Table 4, for procedures with 
rrors, where Factor 2 contains relatively high loadings for 'Errors’ 
id 'Complexity'. Table 5 shows a relatively high loading for 
Complexity' . This analysis indicates that a single metric 
Dmplexity -- suffices for explaining the variance in the Errors 
2 tric. A similar result was obtained using all procedures. 





Table 4 

Factor Loadings 
N = 31 


( Error 


Procedures ) 


Metric 


Factor 1 




Factor 2 


Errors 


0.30128 




0.94013 


Complexity 


0.68442 




0.66467 


Statements 


0.94057 




0.29572 




Table 5 

Factor Loadings 
N = 31 


( Error 


Procedures ) 


Metric 


Factor 1 




Factor 2 


Complexity 


0.44002 




0.89799 


Statements 


0.89799 




0.44002 



CONCLUSION: The results are mixed. Although the results of tests 2, 3 
and 5 are favorable. Complexity failed mandatory Test 1. Thus, 
evaluating the results conservatively. Complexity is judged to be 
invalid with respect to Associativity. Statements does not perform as 
well as Complexity and is invalid with respect to Associativity. 
Furthermore the factor analysis indicates that only one of the metrics 
-- Complexity -- is needed. 

Consistency 

I. Compute the Spearman Coefficient of Rank Correlation (r) for E 
and C over ail procedures with errors. Correlation is lower for E 
and S than for E and C and is not shown. Compare r with B = .7 and 
a with .05 (5,81- 



Table 6 

Spearman Rank Correlation (Error Procedures) 

N = 31 

Complexity Remarks 

Errors .5119 r < .7 

n .0051 a < .05 

RESULT: The desired result is r > .7 and ct < .05. Complexity does 
not change consistently with changes in Errors across all procedures 
with errors. Therefore Complexity is not valid with respect to 
consistency. Also, Statements is not valid with respect to 
Consistency. 
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Ilscrimitative Power 

[ 

J. Divide the data into two sets: procedures with errors and 

jrocedures with no errors. Rank these sets according to their C and S 
jalues (statistical programs will do the ranking automatically) and 
prform the Mann-Whitney test to see whether C and S can discriminate 
latween the two sets of procedures (i.e., tell the difference between 
Ugh quality and low quality software) [5,8]. 

I2SULT: The results of the Mann-Whitney test for C and S are shown in 
'able 7. The average ranks of C and S for procedures with errors are 
uch higher than the average ranks for procedures with no errors, 
respectively . We can infer from the low probabilities of higher 
itatistics that C and S for procedures with errors have significantly 
ligher medians in the populations (i.e., that C and S could 
(iscriminate apriori between high quality and low quality software), 
taution: a large number of ties weakens this test. There are a large 

lumber of ties in C but not in S [5,8]. 

Table 7 

Mann-Whitney Test: Comparison of Two Samples 



ample 1: Complexity - Procedures with errors 

ample 2: Complexity - Procedures with no errors 

.verage rank of first group = 85.9032 based on 31 values, 

verage rank of second group = 45.2469 based on 81 values, 
arge sample test statistic Z = -6.30181 

■wo-tailed probability of equaling or exceeding Z = 2.^5465E-10 
1: 112 total observations. 



ample 1: Statements - Procedures with errors 

ample 2: Statements - Procedures with no errors 

verage rank of first group = 85.2419 based on 31 values, 

verage rank of second group = 45.5 based on 81 values, 

arge sample test statistic Z = -5.82408 

wo-tailed probability of equaling or exceeding Z = 5.76106E-9 
: 112 total observations. 

a. Divide the data into four categories, as s.iown in Table 8, 
:cording to a critical value of C, C (1 so that a Chi-square test 
an be performed to determine whether C f can discriminate between 
rocedures with errors and those with no errors. C r is chosen to 
rovide at least five observations for each cell in Table 8 in order 
a ensure the validity of the test. This may involve trial and error 
>]. 



Table 8 



Contingency Table 





Complexity 
< 3 

| 


Complexity 
> 3 

| 


1 


No Errors 


1 

1 

1 75 

1 

1 _ 


1 

1 

1 6 
1 

__i 


1 

1 

1 81 

I 

i 

| 


Errors 


1 

1 

1 io 

1 

1 — - 


| 

1 

1 21 
1 

1 


1 

1 

1 31 

1 

| 




1 

85 


1 

27 


1 

112 



RESULT: The result of the Chi-square test is shown in Table 9. From 

the high value of chi-square and the very small significance level 

in the samples, we infer that C c could discriminate between 

procedures with errors (low quality software) and those without 
errors (high quality software). 

Table 9 

Summary Statistics (or Contingency Tables: C c = 3 

Chi-square D.F. Significance 

44.6081 1 2 . 40692E-11 

Sensitivity Analysis of Critical Value of Complexity 

In order to see how good a discriminator C r is for this 

example, we observe the number of misclassifications that result lor 
various values of C f : 1) Type I ( 'error procedures' classified as 
no error procedures') and 2) Type 2 (no error procedures' 
classified as 'error procedures'). This is shown in Figure 4. As 
C r increases, Type I misclassifications increase because an 
increasing number of high complexity procedures, many of which have 
errors, are classified as having no errors'. Conversely, as C t 
decreases, Type 2 misclassifications increase because an increasing 
number of low complexity procedures, many of which have no errors, 
are classified as having 'errors'. The total of the two curves 
represents the Tmsclassif ication' function. It has a minimum at 
C f = 3, which is the value given by the Chi-square test (the 
Chi square test will not always produce the optimum C ( but the 
value should be close to optimum). 



INCORRECT CLASSIFICATION 
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CRITICAL VALUE OF COMPLEXITY 



The foregoing analysis assumes that the costs of Type I and Type 
( nnsclassif ications are equal. This is usually not the case since 
le consequences of not finding an error (i.e., concluding that 
here is no error when, in fact, there is an error) would be higher 
lan the other case (i.e., concluding that there is an error when, 
fact, there is no error). In order to account for this situation, 
he number of Type I misclassif ications, for given values of C f , 
multiplied by C1/C2 (C1/C2 = 1, 2, 3, 4, 5), which is the ratio 
f the cost of Type I misclassif iration to the cost of Type 2 
usclassif ication. These values are added to the number of Type 2 
lisclassif ication to produce the family of five 'cost' curves shown 
n Figure 5. Naturally, with the higher cost of Type 1 
usclassif ications taking effect, the optimum C r (i.e., minimum 
ost) decreases. However, even at C/C2 = 5, C ( . = 3 is a reasonable 
hoice. 

!b. Do Step 2a. for S. The Contingency Table is shown in Table 10. 

Table 10 

Contingency Table 
Statements Statements 

£ 13 >13 



1 

tfo Errors | 

1 


64 


1 17 

1 


1 81 

1 

i 


1 

1 

Errors 1 

1 


7 


1 

1 

1 24 

1 


l 

1 

1 31 

1 

l 


1 


71 


. _ 1 __ 

41 


112 




Table 
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Summary Statistics 


for Contingency Tables: 


S f = 


Chi-square 


D.F. 


Significance 




30.7658 


1 


2.91118E-8 





RESULT: The same comments made in Step 2a. apply to S r 
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Sensitivity Analysis of Critical Value of Size 

The same type of analysis is performed on S ( as was performed 
on C ( to see how good S ( is as a discriminator of quality. The 
curves of Type 1, Type 2 and total nnsclassifications are shown in 
Figure 6, where it is seen that the optimum S f = 15, as opposed to 
c = 13, as given by the Chi-square analysis. The 'cost' curves are 
shown in Figure 7, where again the. optimal S r decreases as CI/C2 
increases. Considering the family of cost curves, S t = 13 is a 

reasonable value but S f does not 
example, because, whereas S r = 15 
cost curves, C f = 3 is optimum for three of the five curves. This 
result could be anticipated by the higher Chi-square and lower value 
of significance (better ability to distinguish between high and low 
quality) obtained for C in Table 9 as compared to the corresponding 
values obtained for S in Table II. 

3. Perforin the Krusal-Wallis test (not shown) to ascertain whether C 
and S are good discriminators with respect to given values of E (i.e., 
higher ranks of C and S for higher values of E). 

RESULT: C and S were good discriminators when both procedures with 
errors and all procedures were evaluated. 

Discriminant Analysis 

Another approach to estimating and using a critical value of a 
metric is to use discriminant analysis [1,15]. We briefly describe 

this method more to indicate its general potential than as a method 
that can be applied in this example because, unfortunately, 
discriminant analysis is based on the assumption that the random 
variables are normally distributed [1,15]. This is not the case for E, 
C and S, as was observed from the goodness of fit tests. 

In this technique, a linear function of random variables, called 

the discriminant function, is found such that, when this function is 
evaluated, its value can be used to classify the random variables into 
one of N groups. For example, a linear function of C and S: 

L = b f C + b s S 

can be used to classify the tupple (C,S) into the 'error' group or 

’no error' group depending on whether L '£ or < L the cutoff 

value of the discriminant function. The coefficients of L are 
determined by maximizing the ratio of the variance between the two 
groups to the variance within groups, thus providing for maximum 
discrimination. Using C p and S ( , of the 'error' group, a 

cutoff value L p is determined for the 'error' group. 

Similarly, using C llf and S ne of the ‘no error' 
group, a cutoff value L n(> is determined for the no error' 

group. The two values of L are combined to form a single cutoff 
value L. A reasonable way to do this is to weight L r by 

the probability of a component being in the 'error' group (i.e., the 
fraction of components in the ’error' group) and to weight 

L nP by the probability of a component being in the 'no 
error' g oup (i.e., the fraction of components in the no eiror' 
group). 



perform as well as C r in this 
is not optimum for any of the 
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A was the case with factor analysis, it was found that using 

bth C and S was no better than using C or S alone in the 

dscriminant function. If a single variable is used, a very 
iteresting and useful result is obtained. In this case, the 

cefficients in L become b = I; then L = C, L = S. Using this result 

L, produces two cutoff values L f = Cand L s 

S Thus the mean values of C = 2.53 or 3 (same value as obtained 

vi th Chi-square) and S = 14.29 or 14 (value obtained with Chi-square 

* 13) could be used for C f and S s , respectively. The great 

elvantage of this approach over the Chi-square technique is that 
C and S s can be used directly, thus obviating the need for 

tial and error calculations with Chi-square. 

{INCLUSION: C and S are valid with respect to the Discriminative 
fiwer criterion and either could be used to distinguish between 
t'ceptable (C s 3, S s 13) and unacceptable quality (C > 3, S 

13) for this and similar applications when this data can be 

ollected. However, only one is needed (i.e., C is highly correlated 
jith S and the correlation between E and C/S (normalized) is close 
> 1 ) 0). It should be noted that it is less expensive to collect S 

lan C. 

'racking 

Ideally we want to track a metric against a quality factor over 
ime for a single component (e.g., procedure). Unfortunately this type 
f data is not always readily available because a time history of 
orresponding quality factor and metric changes is required. This data 
as not available in this example. In lieu of this data, the Spearman 
oefficient of Rank Correlation (r) can be used as a measure of the 
rdering of the metric in relation to the quality factor, with project 
eing the component' (see below). Note, however, that (r) does not 
ave a chronological ordering. Also, while (r = 1) implies perfect 
racking, as defined previously, the converse is not true. 

. Compute the Spearman Coefficient of Ran). Correlation (r) for E 
nd C for Projects I, 2, and 4 separately (Project 3 is not used 
ecause it has only one error). Correlation is iower for l and S 
han for E and C and is not shown. Compare (r) with B = .7 and « 

/i th .05. Procedures with errors are used rather than all procedures 
ecause the latter has too many ties in the sample. Rank correlation 
hould not be used when there is a large number of ties. A moderate 
umber of ties is tolerable [5,8]. 
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Table 12 

Spearman Rank Correlations (Error Procedures) 

Project 1, N = 5 (small sample size) 

Complexity Remarks 

Errors .8250 r > .7 

or .0990 or > .05 

Project 2, N = 12 
Errors .6723 r < .7 

a .0258 a < .05 

Project N = 13 

Errors .2522 r < .7 

a .3824 a > .05 

RESULT: The desired result is r > .7 and a < .05 (i.e. indication 
of on-zero correlation) for each project. Complexity does not track 
changes in Errors sufficiently for any of the projects. Therefore, 

Complexity is not valid with respect to Tracking. Also, Statements 
is not valid with respect to Tracking. 

2. Subsequent to calculating (r), we were able to observe 

chronologically the procedures which comprise a project, so that for 
this example the project was the 'component' and the procedures that 
comprise the project were 'tracked'. A runs test was conducted for 
Projects 1 and 2 by assigning a '1' if M changed in the same direction 
as F (i.e. tracks) and a 'O' if this was not the case (does not 
track). The runs test determines whether the binary sequences (runs) 
are systematic (i.e., M tracks F) or would be expected by chance. 

RESULT: Projects 1 and 2 failed (did not track) the runs test. 

Predictability 

1. Make a scatter plot of E and C for procedures with errors to obtain 
a rough analysis of linearity [15]. 

RESULT: The dots on Figure 8 show the relationship. 



Recession of Errors on Complexity 




8 



12 



16 



Figure 8 Complexity Metric 
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. Perform a linear regression analysis of E on C for procedures with 
rrors . 

a. Test whether the assumptions of linear regression analysis hold 

or these data. Two of the important assumptions are: (1) E is 

ormally distributed for given values of C and (2) the variances of E 
re equal for given values of C [15]. 

ESULT: For cases of C = 1 and 2, where there was an adequate sample 
ize, tests were conducted and it was found that neither assumption 
olds. In addition, E was not normally distributed when all 112 
rocedures were used in the analysis. The best fit for E is a negative 
inomial distribution. 

b. Examine the residuals of E (difference between observed and 
•redicted as a function of C [15]. 

.iESULT : Residuals increase with increasing C. This indicates that 

prediction error increases with increasing C. This is an undesirable 
esult since we want prediction error to be independent of C. 

c. The same results were obtained in a. and b. when all procedures 
■ere used. 

. Plot the regression model in Figure 8 for E on C for procedures 
rith errors. The equation is: E = .151 + .404C. The inner band is the 
5% confidence interval for average E (i.e., 95% chance that, for a 

'iven C, the estimate of average E will fall within the band) and the 
>uter band is the 95% prediction interval of E (i.e., 95% chance that, 
or a given C, the estimate of E will fall within the band) [15]. The 
it is worse for regression of E on S (not shown). 

. Compare Observed Errors with Predicted Errors (obtained from 
egression mode!) and note whether Predictability < A = 202, for P 
802 of the predictions. 

:ESULT: Table 13 indicates that Predictability < 20% only 11 out of 
II cases, or 35%; the result is 16% when all procedures are used (not 
ihown). Fails Predictability and Repeatability tests. 
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Table 13 



Observed Prediction 

Errors Error Project 





Predicted 




Predica- 






Errors 




bility (%) 




1 


0.957831 


0.04217 


4.21686 


1 


5 


2.572289 


2.42771 


48.55421 


1 


2 


2.168674 


-0.16868 


8.43373 


1 


1 


2.168674 


-1.16867 


116.86747 


1 


1 


0.957831 


0.04217 


4.21686 


1 


1 


0.554216 


0.44578 


44.57831 


2 


1 


0.554216 


0.44578 


44.57831 


2 


1 


0.554216 


0.44578 


44.57831 


2 


3 


0.957831 


2.04217 


68.07228 


2 


3 


3.379518 


-0.37952 


12.65060 


2 


1 


1.765060 


-0.76506 


76.50602 


2 


3 


2.572289 


0.42771 


14.25702 


2 


2 


0.957831 


1.04217 


52.10843 


2 


1 


1.765060 


-0.76506 


76.50602 


2 


2 


2.168674 


-0.16868 


8.43373 


2 


1 


1.765060 


-0.76506 


76.50602 


2 


8 


6.608433 


1.39157 


17.39457 


2 


1 


0.957831 


0.04217 


4.21686 


3 


1 


2.572289 


-1.57229 


157.22891 


4 


1 


2.168674 


-1.16867 


116.86747 


4 


5 


3.379518 


1.62048 


32.40963 


4 


2 


1.361445 


0.63855 


31.92771 


4 


1 


1.361445 


-0.36145 


36.14457 


4 


1 


2.975903 


-1.97590 


197.59036 


4 


1 


2.168674 


-1.16867 


116.86747 


4 


3 


1.765060 


1.23494 


41.16465 


4 


2 


2.168674 


-0.16868 


8.43373 


4 


5 


5.397590 


-0.39759 


7.95180 


4 


1 


1.765060 


-0.76506 


76.50602 


4 


1 


1.765060 


-0.76506 


76.50602 


4 


2 


1.765060 


0.23494 


11.74698 


4 



5. Try non-linear single independent variable regression models. 

RESULT: Several non-linear (eg., exponential) regressions of E on C 

for procedures with errors had lower correlation and worse fit (not 
shown) than the linear model. 

6. Perform multiple linear regression analysis, using E as dependent 
variable and C and S as 'independent variables'. 



a. Test whether the assumptions of the multiple regression model hold. 
An important assumption of this method is that the 'independent 
variables' are actually independent [15]. 

RESULT: The significant R between C and S of . 833 for al procedures 
indicates dependence. 
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:>. Examine the residuals of E for all procedures [15]. 

RESULT: Residuals increase with increasing C and S indicating that 

prediction error would increase with increasing C and S - an 
indesirable result. 

z. Plot the multiple regression model and compare with results of Step 
3 [15]. 

RESULT: The plots were made but are not shown because the fit is worse 
than in Step 3. For procedures with errors the regression equation is: 
E = .174 + .437C - .00672S. Statements contributes little to the 
relationship. The comparison between simple and multiple regression is 
summarized in Table 14, where F-Ratio is a measure of goodness of fit 
(generally, high value signifies good fit) and P is the percentage of 
predictions that are within the prediction error tolerance (A = 20%). 

Table 14 



E vs . C 
Error 

Procedures 


E vs. C 
All 

Procedures 


E vs. C, S 
Error 

Procedures 


E vs. C, S 
All 

Procedures 


R .783 


.801 


.785 


. 801 


F-Ratio 46.1 


196.9 


22.5 


97.6 


P for 35% 


16% 


35% 


22% 



A < 20% 

CONCLUSION: Neither C nor S meets the Predictability criterion, either 
singly or in combination, for predicting E. Multiple regression has no 
advantage over single variable regression for these data. Also, the 
assumptions of both models are not satisfied. Therefore, both C and S 
are not valid with respect to Predictability. 

Re-validate Metrics 

Repeat all validation tests for C and S on future projects, keeping 
track of P, the Repeatability requirement (i.e., percentage of 
applications a metric must pass validity tests to be certified as 
valid ) . 

Validate and Apply Metrics in Similar Environments 

The final result of the validation exercise is that C and S are 
valid only with respect to the discriminative power criterion to 
support the quality control function. To the extent practical, apply C 
and S in applications and environments on future projects that are 
similar to this one. 
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SUMMARY AND FUTURE RESEARCH 

We described a comprehensive metrics validation methodology that 
has six validation criteria, each of which supports certain quality 
functions. New criteria were defined and illustrated, including 
consistency, discriminative power, tracking and repeatability. It was 
shown that non-parametric statistical methods play an important role 
in evaluating whether metrics satisfy the validity criteria. A 
detailed example of the application of the methodology was presented. 
Although it was not an objective of our research, we found in the 
example that a single metric was sufficient to measure quality. 

Future research is needed to extend and improve the methodology by 
finding answers to the following questions: 

o To what extent are metrics that have been validated on one project, 
using our criteria, valid measures of quality on future projects -- 
both similar and different projects? 

o Can optimum values of 'V',' B', 'A', and ' P' be determined by 
balancing the 'cost' of setting the threshold of validity too high 
versus the 'cost' of setting it too low in order to reduce 
subjectivity in selecting these values? 

o Can optimum critical values of metrics be found for the 
discriminative power criterion by using the 'costs' of 
misclassif ication in order to eliminate the calculation of these 
values by trial and error? 
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APPENDIX A. 

Complexity, S: Number of Source Statements (excluding comments) 
Error Count 
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Procedures with No Errors 
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1 1 
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2 2 

2 2 

2 2 
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2 5 

2 5 

2 3 
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4 2 

4 2 

4 2 
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Procedures with Errors 



Project C 

1 4 

1 16 

1 2 
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2 13 
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